-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-write partitioner to use ColumnChunks instead of ValueVectors #2979
Conversation
a5b30d6
to
be2b2fe
Compare
I've done some optimizations and memory usage is now ~16GB without compression on that dataset (compression on the chunks used in the partitioner has been disabled for now. It may be worthwhile to look into compressing ColumnChunks in-memory more broadly when we're accumulating intermediate values, but it just makes this change more complicated). ColumnChunk capacity was left at 2048 (same as the previous implementation using |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2979 +/- ##
==========================================
- Coverage 93.31% 93.25% -0.06%
==========================================
Files 1124 1124
Lines 42912 42934 +22
==========================================
- Hits 40042 40040 -2
- Misses 2870 2894 +24 ☔ View full report in Codecov by Sentry. |
be2b2fe
to
6151053
Compare
@@ -151,12 +151,12 @@ class BoolColumnChunk : public ColumnChunk { | |||
enableCompression, hasNullChunk) {} | |||
|
|||
void append(common::ValueVector* vector) final; | |||
void appendOne(common::ValueVector* vector, common::vector_idx_t pos) final; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine temporarily, but we should refactor the two interfaces append(common::ValueVector* vector)
and appendOne(common::ValueVector* vector, common::vector_idx_t pos)
into a single one by passing SelVector as an argument,
void append(common::ValueVector* vector, SelVector& sel);
1a79a66
to
2042c47
Compare
ValueVectors have high memory fragmentation, and allocate strings in 256KB chunks for only 2048 strings. ColumnChunks can have a much larger capacity, and also support string de-duplication.
2042c47
to
38e4398
Compare
ValueVectors have high memory fragmentation, and allocate strings in 256KB chunks for only 2048 strings.
ColumnChunks can have a much larger capacity, and also support string de-duplication.
Fixes #2863 (peak memory usage when copying the dataset was 13GB (36GB before I enabled string de-duplication in the partitioner's column chunks).
Fixes #2957
I'm not sure if we should leave string compression enabled in the Partitioner, as it can increase memory usage for datasets with little duplication. At the least it should probably follow the global compression setting (which I don't think is easily accessible to the Partitioner).